NTU-MC Toolkit: Annotating a Linguistically Diverse Corpus

نویسندگان

Liling Tan

Francis Bond

چکیده

The NTU-MC Toolkit is a compilation of tools to annotate the Nanyang Technological University Multilingual Corpus (NTU-MC). The NTU-MC is a parallel corpora of linguistically diverse languages (Arabic, English, Indonesian, Japanese, Korean, Mandarin Chinese, Thai and Vietnamese). The NTU-MC thrives on the mantra of "more data is better data and more annotation is better information". Other than increasing parallel data from diverse language pairs, annotating the corpus with various layers of information allows corpora linguists to discover linguistic phenomena and provides computational linguists with pre-annotated features for various NLP tasks. In addition to the agglomeration existing tools into a single python wrapper library, we have implemented three tools (Mini-segmenter, GaChalign and Indotag) that (i) provides users with varying analysis of the corpus, (ii) improves the state-of-art performance and (iii) reimplements a previously unavailable annotation tool as a free and open tool. This paper briefly describes the wrapper classes available in the toolkit and introduces and demonstrates the usage of the Mini-segmenter, GaChalign and Indotag.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tan Liling and Francis Bond . Building and Annotating the Linguistically Diverse NTU - MC ( NTU – Multilingual Corpus )

The NTU-MC compilation taps on the linguistic diversity of multilingual texts available within Singapore. The current version of NTU-MC contains 375,000 words (15,000 sentences) in 6 languages (English, Chinese, Japanese, Korean, Indonesian and Vietnamese) from 6 language families (Indo-European, Sino-Tibetan, Japonic, Korean as a language isolate, Austronesian and Austro-Asiatic). The NTU-MC i...

متن کامل

Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)

The NTU-MC compilation taps on the linguistic diversity of multilingual texts available within Singapore. The current version of NTU-MC contains 595,000 words (26,000 sentences) in 7 languages (Arabic, Chinese, English, Indonesian, Japanese, Korean and Vietnamese) from 7 language families (Afro-Asiatic, Sino-Tibetan, Indo-European, Austronesian, Japonic, Korean as a language isolate and Austro-...

متن کامل

MEETING STRUCTURE ANNOTATION Annotations Collected with a General Purpose Toolkit

We describe a generic set of tools for representing, annotating, and analyzing multi-party discourse, including: an ontology of multimodal discourse, a programming interface for that ontology, and NOMOS – a flexible and extensible toolkit for browsing and annotating discourse. We describe applications built using the NOMOS framework to facilitate a real annotation task, as well as for visualizi...

متن کامل

Meeting Structure Annotation: Data and Tools

We present a set of annotations of hierarchical topic segmentations and action item subdialogues collected over 65 meetings from the ICSI and ISL meeting corpora, designed to support automatic meeting understanding and analysis. We describe an architecture for representing, annotating, and analyzing multi-party discourse, including: an ontology of multimodal discourse, a programming interface f...

متن کامل

The MC-value for monotonic NTU-games

The MC-value is introduced as a new single-valued solution concept for monotonic NTU-games. The MC-value is based on marginal vectors, which aze extensions of the well-known marginal vectors for TU-games and hyperplane games. As a result of the definition it follows that the MC-value coincides with the Shapley value for TU-games and with the consistent Shapley value for hyperplane games. It is ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

NTU-MC Toolkit: Annotating a Linguistically Diverse Corpus

نویسندگان

چکیده

منابع مشابه

Tan Liling and Francis Bond . Building and Annotating the Linguistically Diverse NTU - MC ( NTU – Multilingual Corpus )

Building and Annotating the Linguistically Diverse NTU-MC (NTU-Multilingual Corpus)

MEETING STRUCTURE ANNOTATION Annotations Collected with a General Purpose Toolkit

Meeting Structure Annotation: Data and Tools

The MC-value for monotonic NTU-games

عنوان ژورنال:

اشتراک گذاری